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For many real-world networks only a small "sampled" version of the original network may be 
investigated; those results are then used to draw conclusions about the actual system. Variants of 
breadth-first search (BFS) sampling, which are based on epidemic processes, are widely used. Al- 
though it is well established that BFS sampling fails, in most cases, to capture the IN-component(s) 
of directed networks, a description of the effects of BFS sampling on other topological properties 
are all but absent from the literature. To systematically study the effects of sampling biases on 
directed networks, we compare BFS sampling to random sampling on complete large-scale directed 
networks. We present new results and a thorough analysis of the topological properties of seven dif- 
ferent complete directed networks (prior to sampling), including three versions of Wikipedia, three 
different sources of sampled World Wide Web data, and an Internet-based social network. We detail 
the differences that sampling method and coverage can make to the structural properties of sampled 
versions of these seven networks. Most notably, we find that sampling method and coverage affect 
both the bow-tie structure, as well as the number and structure of strongly connected components in 
sampled networks. In addition, at low sampling coverage (i.e. less than 40%), the values of average 
degree, variance of out-degree, degree auto-correlation, and link reciprocity are overestimated by 
30% or more in BFS-sampled networks, and only attain values within 10% of the corresponding 
values in the complete networks when sampling coverage is in excess of 65%. These results may 
cause us to rethink what we know about the structure, function, and evolution of real-world directed 
networks. 

PACS numbers: 89.75.Hc, 89.75.Da, 02. 10. Ox 



I. INTRODUCTION 

In the last decade, a flood of research on systems that 
can be represented as networks has revealed that most 
differ markedly from simple random graph models 
For example, many exhibit a broad, or "scale-free" de- 
gree distribution, making them robust to random failures 
but rendering them vulnerable to targeted attacks 0- 
[tJ. Complex networks research also offers a framework 
for representing biological processes such as gene regula- 
tion Q , protein-protein interactions IToj , and even con- 
nections between diseases and symptoms E2 ■ O ur 
growing understanding of how epidemics spread on net- 
works has led to parallel insights into the pro pagation of 
information, fashions, ideas, and fads [Fil - [l9j . Complex 
network studies have incorporated features such as the 
weight and direction of links to describe systems more 
precisely. Link directionality plays a particularly impor- 
tant role in dynamics, as small changes to link structure 
can completely change the dynamics on a network 
[23j . Thus, capturing the directional structure is essen- 
tial to understanding the dynamics on directed networks, 
much as the connectivity structure is essential to dynamic 
processes on undirected networks. 

One impediment, however, is that it is difficult, if 
not impossible, to obtain a complete list of links for 
many networks, including, for example, the World Wide 
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Web (WWW) or large-scale gene-regulatory networks. 
The Web changes so quickly that by the time one 
could have covered it, it would be substantially trans- 
formed [24], [25J . Even if one somehow managed to com- 
pletely map its structure at some point in time, analyz- 
ing such a large network, estimated to contain at least 
19 billion pages [25j . would present further impediments. 
There is no way to avoid biases because the Web can only 
be sampled by following directional hyperlinks, which 
leaves portions inaccessible. Furthermore, as has been 
found to be the case with sampled, undirected networks, 
the sampled Web's appearance might even fundamen- 
tally change depending on the type of sampling method 
used (2(| [23] . If we are to have a clear and reliable picture 
of large-scale directed networks and their statistical prop- 
erties, it is important that we quantitatively understand 
the effects of sampling biases on the properties of interest, 
as well as why such biases arise. Insight into these ques- 
tions stands to impact structure-exploiting search and 
ranking algorithms, such as Google's PageRank [28l l29j . 
and may cause us to rethink what we know about the 
structure, function, and evolution of real-world networks. 

Up to now the statistical properties of sampled undi- 
rected networks have been investigated in several papers. 
Stumpf et al. [3(1 HH studied the degree distribution of 
two random networks - one that had been sampled "uni- 
formly" by picking nodes at random, and one that was 
subject to connectivity-dependent sampling - both an- 
alytically and numerically. Lee et al. [32j also studied, 
numerically, the effects of random sampling and snowball 
sampling |33|, on statistical properties of real scale- free 
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networks - including degree distribution exponents, be- 
tweenness centrality exponents, assortativity, and clus- 
tering coefficient - demonstrating that these quantities 
could be either overestimated or underestimated, de- 
pending on the fraction of the network sampled and the 
type of sampling method used. Kurant et al. [H|,[35| pro- 
vided a detailed analytic treatment of measurement bias 
due to random sampling and breadth-first search (BFS) 
sampling (36j, showing that, for example, BFS sampling 
can lead to overestimation of average degree, and random 
sampling to underestimation. 

In the case of directed networks, however, even though 
the structural properties of man y di rected real- world net- 
works have been examined [37H4l| . much less work has 
addressed the statistical properties that result from sam- 
pling them [42|, . The majority of numerical studies 
in these works were not performed at a level of rigor 
sufficient to produce meaningful statistics. Also, the re- 
searchers began their studies with already-biased network 
data since they were obtained using a web crawler. 

For these reasons, we investigate the biases induced by 
sampling large-scale directed networks starting with com- 
plete networks that differ from one another structurally. 
Several structural properties such as average degree, vari- 
ance in degree, degree auto-correlation, reciprocity, as- 
sortativity, and component structure - all of which are 
defined in later sections - are analyzed to give a more 
complete picture of sampling-induced biases. Because we 
know the full, final structure we can accurately measure 
how systematic errors in measured quantities are affected 
by sampling coverage and sampling method. We find that 
the earlier conclusions in [32I |35| regarding biases in av- 
erage degree of undirected networks due to random sam- 
pling and BFS sampling also hold for directed networks. 
On the other hand, in direct contradiction with [42 |. we 
conclude that both random sampling and BFS sampling 
overestimate edge reciprocity in the networks we study. 
We show that both sampling methods overestimate de- 
gree auto-correlation, sometimes by nearly 400%. In ad- 
dition, we find that random sampling and BFS sampling 
affect the variance of in- and out-degree differently: both 
are underestimated by random sampling while variance 
in in-degree is underestimated and variance in out-degree 
overestimated by BFS sampling. Finally, we expand on 
the work in 42j by providing a thorough examination of 
component sizes and abundances under random sampling 
and BFS sampling. 

In Sec. HH we define the large-scale structural proper- 
ties of directed networks, the so-called "bow-tie" struc- 
ture [44|], and introduce the complete networks studied 
in this work. We also provide a detailed accounting of 
the sampling methods used. In Sec. IIII( we present re- 
sults for BFS sampling and uniform, random sampling on 
these networks and, where possible, provide arguments 
regarding how sampling can lead to measurement bias. 
We systematically study how accuracy is affected by the 
fraction of the network sampled in the two cases. Finally, 
summary and concluding remarks are given in Sec. IIVI 



II. PROPERTIES OF DIRECTED NETWORKS 
A. Directed Networks and Bow-Tie Structure 

If one explores a directed network by following links, 
some portions of the network are reachable while other 
portions may not be. It might be possible to go from one 
site to another, while the return journey is impossible. 
This results in a component picture of a directed net- 
work, as shown in Fig. [TJ We can define a set of nodes 
among which a path both to and from all other nodes 
in the set exists. This is a strongly connected component 
(SCC) [45|. A directed network can be decomposed into 
SCCs if isolated nodes, or nodes with only a single in- 
coming or outgoing link are considered to be their own 
SCC. Then Tarjari's algorithm [45[ can be easily modi- 
fied to identify the network's SCCs. The largest SCC is 
called the giant strongly connected component (GSCC), 
and corresponds to the knot in the so-called bow-tie struc- 
ture |44|]. However, we can also ignore link directional- 
ity and identify the sets of nodes that are connected. 
These are weakly connected components (WCC). A frag- 
mented network may contain several WCCs; the largest 
of these is called the giant weakly connected component 
(GWCC) [13 and the WCC which contains the GSCC 
is defined as the primary weakly connected component 
(PWCC). Usually the PWCC is identical to the GWCC. 

The out-component(s) (OUT) of a network are found 
by starting from the GSCC and following outgoing 
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FIG. 1. (Color online) Schematic of components and sur- 
faces of a directed network. The giant strongly connected 
component (GSCC) and the in- and out-components (IN and 
OUT) account for the "bow-tie" structure. Together with 
tendrils (TEND), these components form the primary weakly 
connected component (PWCC). Portions of the network that 
are not connected to the PWCC are disconnected components 
(DISC). Nodes of the GSCC that are directly connected to 
nodes of IN or OUT form the surface of the GSCC; the nodes 
in IN or OUT to which GSCC surface nodes connect form the 
IN and OUT surfaces. 
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TABLE I. Summary of component ratios for the data sets. Nq is the total number of nodes in the 
networks, Nscc is the number of SCCs in the networks, and the component ratios mean how many 
nodes are placed in each component. Surface nodes ratio is the percentage of the surface nodes. 
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Wiki2007 


3,512,462 


1,148,923 


67.2 1.4 


31.4 


0.1 


46.5 


62.4 



links. All those nodes that can be reached from the 
GSCC but that do not have paths back are part of an 
out-component. Conversely, all nodes that can reach 
the GSCC following directed links, but that cannot be 
reached from it form the in-component(s) (IN) of the net- 
work. The IN and OUT correspond to the two wings of 
the bow-tie, shown in Fig. Q] All other nodes that are 
in the PWCC but that are not themselves part of the 
GSCC, IN, or OUT form tendrils (TEND). (Note that 
our definition of TEND is not the same as in previous 
works [42|, S3 7 as we include within tendrils what they 
call tubes - direct bridges between IN and OUT.) Any 
other nodes in the network must be disconnected from 
the PWCC and are therefore said to be disconnected 
components (DISC). The GSCC connects with IN and 
OUT through surfaces of these components. The GSCC- 
surface is comprised of the nodes in the GSCC that share 
links with nodes in IN or OUT components; nodes in IN 
that adjoin the GSCC form the IN-surface; and the nodes 
in OUT that abut the GSCC form the OUT-surface. The 
set of nodes in the GSCC, excluding the surface nodes, 
is its core [H |H] (see Fig. HJ. Cores for IN and OUT 
can also be defined. Broder et al. reported that 30% 
of their sampling of the WWW is GSCC^while IN and 
OUT each have roughly 23% of the nodes Q. This type 
of result varies strongly from network to network. Such 
differences between many real-world, directed networks 
are pointed out in the next section. 



B. Data Sets 

We analyze seven networks: three sampled Web data 
sets from different sources, one complete social net- 
work, and three versions of the entire English language 
Wikipedia network. The Web data is a combined set 
of Web pages from the University of California Berkeley 
(berkeley.edu) and Stanford University (stanford.edu), 
denoted by "BerkStan" , Web pages solely from Stanford 
University (stanford.edu), denoted by "Stanford", and a 
set of Web pages released by Google in 2002 as a part of 
the Google Programming Contest. All three of these data 
sets are available for download from the Stanford Large 



Network Dataset Collection [UHl]- I n addition, we have 
gathered social network data from an amateur photog- 
raphers' website, RaySoda [49[, where each node corre- 
sponds to a photographer, and where a directed link from 
A to B indicates that A follows B. The largest networks 
we analyze are the Wikipedia networks [5fJ (~ O(10 6 ) 
nodes) - three networks collected at different times (2005, 
2006, and 2007). These networks, downloaded from [5l| . 
contain nodes representing five types of Wikipedia page: 
articles, categories, portals, disambiguations, and redi- 
rects [52j . The number of nodes in our networks is differ- 
ent from those in [53[ since the networks in [53j contain 
only article pages, while ours contain the full collection 
of pages in the "main" name space of Wikipedia. 

We have elected these seven networks for analysis, not 
only because they vary substantially in size, but also be- 
cause they have different structural properties. As can 
be seen in Fig. [2] and Table U the relative sizes of com- 
ponents can span a wide range: the BerkStan data epito- 
mize the classical bow-tie, with the bulk of nodes residing 
in the GSCC and the remainder balanced between IN and 
OUT; the Wikipedia networks, on the other hand, dis- 
play almost no OUT, but instead show a tendency for 
roughly 67% of the nodes to comprise the GSCC, and 
the rest, the IN; conversely, the nodes of the Notre Dame 
data [54| (which is shown in Fig. [2^c) for comparison, 
but is otherwise not analyzed in this paper), depicting 
webpages within the nd.edu domain tend to concentrate 
in the OUT, revealing no IN and a GSCC containing less 
than 20% of the network's mass. This structure reflects 
how that dataset was obtained: webpages were gathered 
by crawling outward from a particular starting page. 

Remarkably, even with these strong differences in gross 
global structure, we find, as shown in the next sections, 
many common trends in the effects of sampling biases on 
the measured properties of these networks. For all data 
sets, basic network properties, including degree distribu- 
tions, average degree, variance in in- and out-degrees, 
degree auto-correlation, reciprocity, and four types of as- 
sortativity, are determined, and these properties, as well 
as component analyses, are defined and presented in cor- 
responding subsections on sampling. All basic properties 
are summarized in Tables U and [TTl and the values re- 





FIG. 2. (Color online) SCO diagram for (a) BerkStan data S3, El, (b) Wiki2007 [H, and (c) Notre Dame data [H] (not 
otherwise analyzed, but shown here for its distinctive structure). For better visualization, only the 100 largest SCCs have been 
displayed. Each circle corresponds to a SCC, whose size is proportional to the logarithm of the number of nodes in the SCC. 
The width and intensity of the color of the directed links are proportional to their weight, and self-links are omitted. This SCC 
diagram shows the heterogeneity that can exist for the simple bow-tie diagram. 



ported therein are later used for comparison with our 
sampling studies. Because it will be necessary to avoid 
trivial sampling failures (resulting from, for example, net- 
work disconnectedness) , we consider for analysis only the 
GWCC of each network, which by virtue of the fact that, 
in all cases, it contains more than 90% of the network, is 
also the PWCC. 



C. Sampling Methods 

We use two sampling methods: uniform random sam- 
pling and breadth-first search (BFS) sampling. For the 
former, each node is selected independently and with 
equal probability. This method is not feasible on the 
real WWW, but it is a good basis for comparison since 
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TABLE II. Summary of the basic network properties for the data sets. No is the number of nodes in 
the GWCC; 

n(out)) is the average incoming (outgoing) degree, always (kin) — (k out ); af u and Co Ut 
are the variances of in- and out-degree, respectively; r a is the degree auto-correlation; R is the global 
reciprocity; and m, n 0) r i, and r 00 are the in-degree/in-degree, in-degree/out-degree, out-degree/in- 
degree, and out-degree/out-degree assortativities. 
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RaySoda 


17,852 
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0.202 


-0.048 


0.048 


-0.125 


0.093 
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1,596,970 


12.37 
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0.203 


0.122 


-0.014 


0.017 
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-0.032 
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2,935,761 
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56,821.4 


1,095.8 


0.196 
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-0.008 


0.014 


-0.051 
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it is analytically tractable and is related to well-known 
percolation phenomena [5 7] . The latter method is more 
complex but has been broadly adopted for web crawling, 
and so is important to analyze [26|, [2?], HH [35[ . Starting 
from a few randomly-selected nodes (seeds), neighboring 
nodes connected by outgoing links are visited at each suc- 
cessive step like a process of gossip spreading [l9| . At the 
outset, the seeds are added to the BFS queue. One at a 
time, the outgoing links of these seeds are explored, and 
the visited neighbours are added to the queue. We define 
the growing front nodes to be those nodes in the sampled 
network whose outgoing links have not yet been explored 
- i.e. those nodes most recently added to the queue. Be- 
fore sampling begins, a targeted sampling coverage - the 
fraction of the network one wishes to sample - is also 
chosen. When this coverage is reached, the process ter- 
minates and all edges connecting already visited nodes 
are included as part of the final sampled network. This 
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FIG. 3. (Color online) Sampled coverages using two different 
sampling methods. While random sampling is able to achieve 
the targeted coverage, in most cases, BFS only covers up to 
the size of the GSCC and OUT. The graph shows the 2007 
Wikipedia data which contains 3,512,462 nodes, and for which 
the combined fraction of nodes in GSCC and OUT is about 
68.5%. 



procedure is analogous to web crawling, initiating with 
several portal pages from which Web pages are iteratively 
gathered. 

While BFS sampling will always cover the entire net- 
work in an undirected (connected) network, this is not 
the case when BFS is used to sample directed networks. 
In the worst case scenario, if one chooses as a starting 
node a node with no outgoing links, the procedure can- 
not proceed to the next step. We always choose n = 10 
seed nodes as starting nodes both to decrease the likeli- 
hood of this type of failure and to minimize the effects of 
interference between random and BFS sampling. When 
we sample N nodes from among the No nodes of the 
real network in order to achieve a sampling coverage, 
a = N/No, randomly selecting n nodes as seeds affects 
the sampling properties of BFS, so that as n — > N, BFS 
sampling simply becomes random sampling. 

In this paper, we consider coverages of 0.25% to 100%. 
Mostly the sampled coverage matches the target cover- 
age as shown in Fig. [3l However, because BFS sampling 
gathers new nodes by successively exploring nodes' out- 
going neighbours, its coverage cannot exceed the com- 
bined size of the GSCC and OUT, which may relate with 
the "reachability" in directed networks [l!| . We analyze 
all properties of the sampled network as a function of 
the sampled coverage a. For every coverage, each sam- 
pling method was executed one hundred times on each 
network. 



D. Sampling Measurements 

We measure the following directed network properties: 
average degree (fc), variances of incoming and outgoing 
degrees (af n , 

Lt)) degree auto-correlation r a [3, HI; link 
reciprocity R 1381 . and four kinds of assortativity (ra, r lQ: 
r j, and r OQ ) [53. These will be defined in corresponding 
subsections. In addition, SCC analyses are performed, 
and we study how the SCCs and bow-tie structure change 
in response to sampling. For each sampling coverage, we 
record the ratios between the sizes of the GSCC, OUT, 
IN, TEND, and DISC, as well as how many nodes com- 
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prise these components' surfaces - i.e., their points of 
contact [4l],[4f|. We further measure how, for BFS, the 
growing front ratio depends on a. All the basic mea- 
surements for the complete data sets are summarized in 
Tables [T] and [II] as a baseline to compare with sampling 
results. 



III. SAMPLING RESULTS 



A. Average Degree and Degree Variances 



node). The average total degree of a directed network 
is (fc) = (fc in ) + (fcout), where (fc in ) = iV _1 X^ 6V fc? n , and 
similarly for (fc out ). Here TV is the number of sampled 
nodes in the network and V is the set of sampled nodes. 
Of course, (k- m ) = (k ut) = (&)/2. The variances for 
in- and out-degrees arc af n = N^ 1 ^, E¥ (^) 2 ~ (fcin) 2 , 
and CTq U( . is similarly defined. Each network has a broad 
in-degree distribution and a narrower out-degree distri- 
bution. Therefore all networks exhibit a higher variance 
of in-degree than that of out-degree as indicated in Ta- 
ble M 

For undirected networks, in the case of uniform random 
sampling, the sampled degree distribution p'(k) can be 
written as, 



p>(k) = £ 



p(*b)(^)o*(l-a) 



kn—k 



ko — k 



(i) 



where p(k) is the degree distribution of the orig inal net- 
work and a is the sampled coverage 0, l3(il - l32| . Equa- 
tion JT]) also describes the incoming and outgoing de- 
gree distribution of randomly sampled directed networks, 
where k and fco are replaced with k- ln and fco,in (or fc ou t 
and fco, out respectively). The average degree of the sam- 
pled network, (fc)', is 

oo oo 

(k)' = kp'(k) = a £ kop(ko) = a(k), (2) 



k=l 



fcn = l 



where (k) is the average degree of the original network. 

The variance of the degree under uniform random sam- 
pling is obtained from 

oo 

(fc 2 )' = J2 fc V(fc) = « 2 (fc 2 ) + a(l - a){k), (3) 



fe=l 



giving 



a' 2 = (fc 2 )' - ((fc)') 2 = a 2 (fc 2 ) +a(l - a)(k) - a 2 (k) 2 
= a 2 a 2 +a(l - a){k) (4) 

where a 2 represents the variance in degree of the original 
network. The same formulas, Eqs. and also hold 



for the variances of the in- and out-degree, respectively. 
Thus a[ 2 and a' 2 ut are both quadratic functions of a, al- 
though with different coefficients (of n ^ cr 2 ut )- When the 
coverage a is small, a' 2 increases linearly with a; for large 
a it increases quadratically. 

This quadratic relation is shown for Wiki2007 in 
Fig. H)Jb). The gray lines behind the random sampling 
data indicate the results calculated from Eq. (|4]). Since 
the variance of the incoming degree is much larger than 
the average degree, a[ 2 seems to be purely quadratic in 
this plot, but the variance of the outgoing degree, a' 2 ut , 
shows the transition from a linear to a quadratic function 



Each node i in a directed network has a number, fc? n , 
of incoming links (pointing to the node) and a num- 
ber, fco Ut , of outgoing links (pointing away from the 
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FIG. 4. (Color online) Behaviors of the average degree (a) 
and the variance of the in-degree and out-degree (b) as a 
function of sampling coverage, a, for each sampling method 
in the Wiki2007 data, (a) In the case of BFS sampling, the 
average degree approaches its asymptotic value- the average 
degree of the combined GSCC and OUT- from above, since 
BFS is biased to the high degree nodes. The average degree 
for random sampling is just linearly proportional to the sam- 
pling coverage, (b) The variances for in- and out-degrees for 
random sampling increase quadratically as sampled coverage 
increases, but those of BFS approach their real values from 
opposite directions. The gray lines behind the random sam- 
pling data are calculated from Eq. ((4]). Hereafter all the error 
bar means the standard deviation of the measured variables 
over sampling realizations. 
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FIG. 5. (Color online) For BFS sampling of Wiki2007 data, 
(a) growing front ratio as a function of coverage and (b) di- 
rected links ratio (the fraction of edges in the growing front 
that point outside (circles) or back into (squares) the already 
sampled network). Note that even for small coverage a sub- 
stantial fraction of links directing from the growing front point 
back to the already sampled network. 



as a increases. As can be seen in Fig. HJb), random sam- 
pling severely underestimates variances of in- and out- 
degree (by as much as two orders of magnitude even at 
~ 10% coverage). This underestimation results from the 
quadratic dependence of Eq. Q on a. 

As shown in Fig. [4j BFS sampling does not obey 
these simple mathematical relationships. Since BFS 
follows outgoing links and reaches hub nodes at early 
times |32l [42j , one could have an argument that the av- 
erage degree of BFS-sampled networks overestimates the 
average degree. However, this can only be true when the 
networks contain loops. In the case of a tree, since the 
network resulting from BFS sampling is still a tree, the 
average degree is 2 — ^ very close to (k) = 2 — « 2. 
The average degree of BFS-sampled networks is related 
to the loop structures and clustering. Therefore we mea- 
sure the size of the growing front under BFS sampling 
and the number of directed links pointing into the al- 
ready sampled networks as shown in Figs. 03a ) and (b). 
In early stages of BFS sampling, although most nodes lie 
in the growing front, the fraction of their links pointing 
back to the already sampled nodes is surprisingly high. 

BFS sampling also overestimates variance of out- 
degree, but underestimates variance of in-degree as can 
be seen in Fig. HJb). However, these errors are less se- 



vere than for random sampling. Variance of in-degrec 
is underestimated in BFS sampling for the same reason 
it is underestimated in random sampling, although the 
misestimations are less severe since the correlated loop 
structures affects the directed link ratio of the growing 
front as shown in Fig. [5fb). Variance in out-degree is 
overestimated for a different reason: visited nodes have 
the same out-degree in the sampled networks as they do 
in the original networks, while the out-degree of growing 
front nodes is not fully counted. As a increases, the effect 
of the growing front nodes diminishes. Indeed the frac- 
tion of directed (and unexplored) links pointing outside 
of the sampled network shrinks quickly even though the 
fraction of nodes on the growing front decreases much 
more slowly, as shown in Fig.fSJa) and (b). 



B. Degree Auto-correlation 

Degree auto-correlation quantifies the extent to which 
nodes of high in-degrce also have high out-degree, and is 
defined as r a = Cov(A;i n , feout)/oinOout- The covariance is 
given by, Cov(fc in , fc out ) = iV" 1 J2 ie v k tn k lut~ ( k in) (k ou t) ■ 
All networks, except for BerkStan and Stanford, have 
moderately high degree auto-correlation (r a > 0.1). 

In the case of random sampling, the degree auto- 
correlation, r a , is unbiased if a is large enough to ensure 
an adequate density of links, since the in- and out-degrees 
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FIG. 6. (Color online) Degree auto-correlation for (a) 
Wiki2007 data and (b) Stanford data under both sampling 
methods. BFS sampling significantly overestimates this quan- 
tity. 
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TABLE III. Summary of incoming and outgoing degree and average local reciprocity for each compo- 
nent. Here (•) means the average over nodes in each component. 



Component 




Wiki2007 






Google 




















size (%) 


(k in ) 


(feout) 


(Ri) 


size (%) 


<Mn> 






GSCC 


67.2 


18.72 


17.76 


0.122 


50.8 


9.51 


8.40 


0.330 


OUT 


1.4 


15.88 


0.04 


0.0004 


19.4 


3.46 


1.35 


0.292 


IN 


31.4 


0.11 


2.84 


0.006 


21.1 


1.47 


5.87 


0.188 


TEND 


0.1 


1.06 


0.44 


0.090 


8.7 


1.25 


1.76 


0.190 


GWCC 


100.0 


12 


82 


0.118 


100.0 


5.92 


0.306 



for each node are sampled randomly. Figures [Ha) and 
(b) show this effect, although, for small a, some nodes are 
isolated and therefore have no in- or out-degree, trivially 
causing an increase in degree auto-correlation (Fig.[6jb)). 
As can also be seen in Figs. [6j BFS enhances - by up to 
400% - degree auto-correlation at low sampling coverage. 



ponents, and since reciprocal links always tie two nodes 
into one component, there are naturally more reciprocal 
links in the GSCC than there are in other components 
as summarized in Tabic Mil thus there is overrepresenta- 
tion of bidirectional links, relative to the total number of 
links, and reciprocity is artificially high as shown most 
clearly in Fig. EJb) for the Google data. 



C. Reciprocity 



The link reciprocity R is defined as the fraction of links 
in a network that participate in a two-way relationship, 
i.e., R = L++/L, where L++ means the number of edges 
belonging to bidirectional connections and L is the total 
number of links in the network {38j . For each node i, we 
can also similarly define a local recicprocity Ri, which is 
the fraction of node i's edges belonging to bidirectional 
connections. 

For random sampling, in the absence of self-links (i.e. 
links that start and end on the same node), reciprocity 
is constant, independent of sampling coverage, since any 
pair of nodes is chosen with the same probability as any 
other pair of nodes. If, however, self-links are present, the 
reciprocity under random sampling is higher than that of 
the true network since the self-links (which are recipro- 
cal by definition) appear with probability a > a 2 . The 
reciprocity with respect to a is R(a) « R(l + a~ 1 c/ 
where c/ is the fraction of self-links among bidirec- 
tional links. Thus, one can see that in the presence of self- 
links, the reciprocity is no longer constant, but quickly 
approaches its asymptotic value as a increases. The data 
in Fig. [TJa) illustrate this effect for Wiki2007 which ex- 
hibits a small fraction (< 0.4%) of self-links among all 
bidirectional links. The gray lines in the figures are the 
expectation lines from the above equation and agree per- 
fectly with the data. 

For BFS sampling, however, reciprocity is significantly 
overestimated, and only slowly approaches its true value. 
At least part of the bias for reciprocity under BFS comes 
from the fact that we only include links in the growing 
front if they point back to the previously sampled graph. 
This introduces a bias to increase reciprocity. This type 
of overestimation is actually present at any sampling cov- 
erage for an additional reason: BFS sampling (in most 
cases) only gathers information about the GSCC and 
OUT components, but not about the IN and other com- 
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FIG. 7. (Color online) Reciprocity changes as sampled cov- 
erage increases for Wiki2007 (a) and Google (b) data. Except 
for an initial state that results from the presence of self-links, 
random sampling shows constant reciprocity while BFS sam- 
pling approaches the true value slowly from above. The gray 
lines behind the random sampling data are the theoretical 
prediction for random sampling. 
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FIG. 8. (Color online) Four kinds of assortativity for BFS-sampled networks and random-sampled ones on (a), (b) Wiki2007 
data and (c), (d) BerkStan data. 



D. Assortativities 

A set of assortativity measures (HJ for directed net- 
works arc defined using the Pearson correlation as fol- 
lows: 



where x, y £ {in, out} indexes each incoming and outgo- 
ing degree type and j l x (k l y ) is the x-degree (y-degree) 
of the tail (head) node for a link I, and E is the set 
of sampled links (all links, if we consider the complete 

network). ~j x = EieEiLA = ^EieE^y) is 
the weighted average degree [5g. The following rela- 

ti ons hold in general: j out ^ j in = k out ^ fc in . si j) = 

\J L^ 1 J^ieeUL ~ 3x) 2 i s the standard deviation of the x- 

degree of the tail nodes (j^ a' x ). Sy is similarly defined. 
It is worth noting that + s { £ t ^ + . 

In most cases, we find that the directed assortativities 
of the networks we study are not markedly different from 
zero, and it is therefore difficult to define a general ten- 
dency for the effects of BFS sampling on the statistics 
of assortativity. We do, however, point out that both 
r OD and r Q i of the BerkStan network are quite large (but 



have opposite sign), suggesting that, unlike the rest of the 
networks, nodes of high out-degree tend to link to other 
nodes of high out-degree, but nodes of high out-degree 
tend to link with nodes of low in-degree. 

While the small assortativities of the networks make it 
dangerous to draw broad conclusions regarding the effects 
of sampling, it is clear that in the case of random sam- 
pling there is a clear tendency in behavior at low values of 
the sampling coverage, which seems to be related to the 
small reciprocity of the networks we study (57j . Assorta- 
tivity between the incoming degree and outgoing degree 
r; Q tends to be overestimated for small sampling coverage; 
on the other hand, the incoming-incoming degree assor- 
tativity is underestimated (implying greater disassorta- 
tivity than is present in the complete networks) as shown 
in Figs. Efb) and (d). These trends seem to stem from 
a trivial situation: when the sampling coverage is small, 
many tail (head) nodes will have no incoming (outgoing) 
degree, even though they are connected to each other. 
Consider two nodes, A and B, connected by a directed 
link from A to B. In this case, A has no incoming de- 
gree and B has no outgoing degree. Thus the correlation 
between in- and out-degrees would be positive, whereas 
the correlation between incoming degrees would be neg- 
ative. This would not be the case if a large fraction of 
nodes had reciprocal links. Not surprisingly, these ten- 
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dencies disappear very quickly as the sampling coverage 
increases. 

Assortativity can be either overestimated or underes- 
timated by BFS, depending on network structure and 
coverage. The real value of in-degree/in-degree assorta- 
tivity is approached from below in Wikipedia data (see 
Fig. Elja)). When randomly picking the seeds for BFS, 
there is a high chance to select small k\ n nodes since in- 
coming degree follows a scale-free distribution. Nonethe- 
less, BFS sampling soon reaches the large k\ n nodes. This 
results in a highly negative ra initially. As the sampling 
coverage increases, ra approaches its real value from be- 
low. However, we do not observe systematic behaviors 
for other assortativities under BFS. 



E. Number of SCCs 

As the sampling coverage increases, the number of 
SCCs increases initially. Since single nodes and nodes 
with only incoming links are considered SCCs by defi- 
nition, the number of SCCs is proportional to the sam- 
pling coverage a, both for random and BFS sampling. 
However, after a certain sampling coverage has been 
reached, newly-sampled nodes are more likely to con- 



nect to already- existing SCCs. For most networks, this 
means that existing SCCs will merge together, whence 
the total number of SCCs will finally decrease. This is 
illustrated in Fig. [HI For both sampling methods, the 
number of SCCs increases linearly with a initially and 
then decreases to the value in the original networks for 
large a. However, the number of SCCs observed in BFS 
sampling is almost one order of magnitude less. 



F. Surface Nodes 

Since surface nodes are in contact with other compo- 
nents, there is a possibility that they will be absorbed 
into component cores or move into other components if 
we add nodes or links from the network. The ratio of 
nodes on the surface of a component to the total number 
of nodes in the component ('surface node ratio') seems to 
depend strongly on the structure of the SCCs of directed 
networks. Of the networks we study, the Wikipedia 
graphs have the largest GSCCs, with upwards of 67% 
of all nodes, and at least 43% of theses nodes are surface 
nodes. The Stanford and BcrkStan networks' GSCCs 
are smaller (59% and 51%, respectively) and contain very 
few surface nodes (7.4% and 9.6%, respectively). A closer 
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FIG. 9. (Color online) Tendency for the number of SCCs 
in Wiki2007 (a) and BerkStan (b) data to increase with both 
sampling methods. After a threshold sampling coverage has 
been reached, the number of SCCs will decrease, since the 
newly-sampled nodes will bridge preexisting SCCs. The black 
dotted lines in (a) and (b) are the reference for the slope 1. 



FIG. 10. (Color online) Ratio of the surface nodes in the 
GSCC and the PWCC of (a) BFS-sampled and (b) randomly 
sampled networks of Wiki2007 data. In the case of the BFS 
sampling, the estimated values do not approach to the true 
values since BFS sampling only covers the all nodes in GSCC 
and OUT components. 
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look at the IN and OUT components of the Stanford net- 
work reveals numerous chains and multinodc (directed) 
cycles that offer only a single surface node for attach- 
ment to the GSCC. Figure \W\ shows the changes to sur- 
face node ratios as sampling coverage increases in the 
Wiki2007 data. 

When the sampling coverage is small, the surface node 
ratio in the GSCC does not change markedly under ran- 
dom sampling. After increasing the sampling coverage, 
however, the ratio decreases as the core becomes more 
densely connected with the addition of newly sampled 
nodes. However, the surface node ratio in the PWCC 
increases as shown in Fig. fTOT b) as the DISC and TEND 
shrink quickly, becoming absorbed into the GSCC, and 
then transforming into surface nodes. 

BFS sampling, on the other hand, shows a different 
trend. The surface node ratio in the GSCC is lower than 
that in the PWCC. This seems to be deeply related with 
the fact that BFS sampling starts from seeds than ex- 
pands their territory layer by layer. When the sampling 
coverage is small, the surface node ratio increases as the 
sampling coverage increases. After the sampling proce- 
dure has reached a certain point, the surface node ratio 
will also begin to decrease as shown in Fig. fTOTa). 



G. Components Ratios 

Here we focus both on the evolution of the bow-tie 
structure and on the component from which the nodes 
are sampled - noted in Fig. ITTT c) and (d) as "% sam- 
pled from the orig. comp."- as the sampling coverage 
increases. BFS sampling mainly covers the GSCC and 
OUT components, so the sizes of the IN and TEND com- 
ponents in the sampled networks remain constant as a 
increases. As coverage increases, the size ratio of the 
GSCC - the ratio of nodes in the current GSCC to the 
total number of discovered nodes - increases slightly as 
the GSCC absorbs other components. 

The main characteristics associated with random sam- 
pling are described by percolation phenomena 0-0] ■ 
When the sampling coverage is small, most of the nodes 
are disconnected and belong to the DISC and TEND 
components. As sampling coverage increases past some 
percolation threshold, the GSCC emerges quickly and the 
IN and OUT components form concurrently as shown in 
Fig. Hlb). 



IV. SUMMARY AND DISCUSSION 

In summary, a comparison of BFS sampling to random 
sampling indicates that differences in sampling method 
and coverage can introduce biases that result in substan- 
tial mischaracterization of the statistics of many struc- 
tural properties in directed networks. Moreover, the ex- 
tent to which sampling biases will affect these properties 



seems to depend heavily on the structure of the orig- 
inal network. In comparing random sampling to BFS 
sampling on seven different directed networks, includ- 
ing three versions of Wikipedia, three different sources of 
sampled World Wide Web data, and an Internet-based 
social network, we found that differences in sampling 
method and coverage affect both the bow-tie structure, as 
well as the number and surface structure of strongly con- 
nected components in sampled networks. In addition, at 
low sampling coverage (less than 40%), the values of av- 
erage degree, variance of in- and out-degree, degree auto- 
correlation, and link reciprocity in sampled networks are 
misestimated by at least 30%, and sometimes by as much 
as four orders of magnitude. The structural properties of 
BFS-samplcd networks attain values within 10% of the 
corresponding values in the original networks only when 
sampling coverage is in excess of 65%. 

Most biases under random sampling seem to stem from 
the fact that both out-degree and in-degree will be ap- 
proximately equally undersampled. This leads to under- 
estimation of average degree and variances of in- and out- 
degree. At the same time, properties such as reciprocity 
and auto-correlation are essentially constant because of 
this equality in undersampling. Biases under BFS sam- 
pling arise from a confluence of factors: by following only 
outgoing links, BFS fails to cover the IN-component of 
directed networks; BFS covers nodes of high in-degree at 
early times; the core of BFS-sampled networks are tan- 
gled with many loops showing high clustering; the in- 
and out-degrees of nodes at the growing front are un- 
dersampled under BFS sampling. In combination, these 
factors (and, possibly others) lead to overestimation of 
some structural properties (average degree, variance in 
out-degree, auto-correlation, and reciprocity) and under- 
estimation of others (variance in in-degree, number of 
SCCs, surface node ratios). We have demonstrated that 
for these reasons, if uniform random, or BFS sampling is 
used to assemble a network, significant corrections to de- 
gree, degree variance, auto-correlation, reciprocity, some 
types of assortativity and component make-up should be 
expected. 

Though we have not examined it here, we suspect that 
there may be an important interplay between sampling 
method, sampling coverage, temporal changes, and sam- 
pled network topologies. The Wikipedia data discussed 
earlier could be used to probe such effects, since it cap- 
tures snapshots of Wikipedia at different times during the 
network's evolution. It would be interesting to quantify 
differences in the effects (if any) of BFS and random sam- 
pling on time- varying or temporal networks [58j |. A nat- 
ural question, after analyzing the drawbacks of sampling 
procedures, will be how we can overcome such problems 
to get unbiased network samplings. A possible solution 
could be a combination of random and BFS samplings 
to get several unbiased structural properties. However, 
it is still challenging work to get unbiased samplings for 
every network properties. There are several papers sug- 
gesting unbiased sampling strategies for specific proper- 
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FIG. 11. (Color online) For Wiki2007 data, component ratios in the sampled networks (a), (b), and percentages sampled 
from the components in the original network (c), (d). (a) It is surprising that most of the network is a GSCC even at very low 
coverage, indicating the importance of loops and clustering around high in-degree nodes. The OUT component ratio slightly 
decreases as a increases since the number of nodes in the GSCC increases more quickly than the number of nodes in the OUT. 
(c) Conversely, the percentage sampled from the OUT increases. As expected, the other components shrink as a power of 
1/N « aT 1 . (d) In the case of random sampling, the percentage sampled from each component is almost constant, (b) On 
the other hand, the component ratio changes substantially as the sampled coverage increases. At small coverage, most of the 
nodes are disconnected, but after a percolation threshold has been reached, the GSCC emerges quickly, absorbing the other 
components. The labels for (c), (d) are the same as those of (a), (b) and the numbers in parentheses are the component ratio 
of the original network. 



ties [HI!!. 

The results presented in this paper have widespread 
implications for conclusions that have been drawn re- 
garding the structure (and function) of some of the most 
ubiquitously studied real- world networks, including the 
World Wide Web. Since for many studied real, directed 
networks only an incomplete link list is available, cither 
because the networks are too large to be fully recorded, 
or because they change too quickly to be captured by 



any sampling procedure, our findings call into question 
the accuracy of previous, reported results for the statis- 
tics of some of these networks' structural properties. We 
may not know as much about the structure of very large 
directed networks as has been supposed. 

ACKNOWLEDGMENTS 

This work was partially supported by the research fund 
of Hanyang University (HY-2012-N) (S.-W.S.). 



[1] S. H. Strogatz, Nature 410, 268 (2001). 
[2] R. Albert and A.-L. Barabasi, Rev. Mod. Phys. 74, 47 
(2002). 

[3] S. N. Dorogovtsev and J. F. F. Mendes, Adv. Phys. 51, 
1079 (2002). 

[4] M. E. J. Newman, SIAM Rev. 45, 167 (2003). 

[5] R. Albert, H. Jeong, and A.-L. Barabasi, Nature 406, 



378 (2000). 

[6] R. Cohen, K. Erez, D. ben-Avraham, and S. Havlin, 

Phys. Rev. Lett. 85, 4626 (2000). 
[7] R. Cohen, K. Erez, D. ben-Avraham, and S. Havlin, 

Phys. Rev. Lett. 86, 3682 (2001). 
[8] E. H. Davidson et al., Science 295, 1669 (2002). 
[9] H. Jeong, S. P. Mason, A.-L. Barabasi, and Z. N. Oltvai, 



13 



Nature 411, 41 (2001). [41] 

C. von Mering et al., Nature 417, 399 (2002). 

K.-I. Goh et al., Proc. Natl. Acad. Sci. USA 104, 8685 [42] 
(2007). 

M. A. Yildirim et al., Nat. Biotechnol. 25, 1119 (2007). 

P. Grassberger, Math. Biosci. 63, 157 (1983). [43] 

R. Pastor-Satorras and A. Vespignani, Phys. Rev. Lett. 

86, 3200 (2001). [44] 

R. Pastor-Satorras and A. Vespignani, Phys. Rev. E 63, [45] 

066117 (2001). [46] 

D. H. Zanette, Phys. Rev. E 64, 050901 (R) (2001). 

M. Kuperman and G. Abramson, Phys. Rev. Lett. 86, 
2909 (2001). [47] 
Y. Moreno, M. Nekovee, and A. Vespignani, Phys. Rev. 
E 69, 055101(R) (2004). [48] 
P. G. Lind, L. R. da Silva, J. S. Andrade, Jr., and H. J. 
Herrmann, Phys. Rev. E 76, 036117 (2007). [49] 
D.-U. Hwang, M. Chavez, A. Amann, and S. Boccaletti, [50] 
Phys. Rev. Lett. 94, 138701 (2005). [51] 
T. Nishikawa and A. E. Motter, Phys. Rev. E 73, 
065106(R) (2006). 

S. M. Park and B. J. Kim, Phys. Rev. E 74, 026114 [52] 
(2006). 

S.-W. Son, B. J. Kim, H. Hong, and H. Jeong, Phys. Rev. 
Lett. 103, 228702 (2009). 

A. Ntoulas, J. Cho and C. Olston, InWWW '04: Pro- 
ceedings of the 13th international conference on World 
Wide Web (2004). 

The size of the World Wide Web (The Internet) - 
http:/ /www. worldwidewebsize.com/ 
S. K. Thomson, Sampling (Wiley) (2002) . 
M. E. J. Newman, Soc. Networks 25, 83 (2003). 
L. Page, S. Brin, R. Motwani, and T. Winograd, Techni- 
cal Report. Stanford InfoLab. (1999) 
S.-W. Son, C. Christensen, P. Grassberger, and M. 
Paczuski, e-print larXiv: 1201. 47871 (2012). 
M. P. H. Stumpf, C. Wiuf, and R. M. May, Proc. Natl. 
Acad. Sci. USA 102, 4221 (2005). 

M. P. H. Stumpf and C. Wiuf, Phys. Rev. E 72, 036118 
(2005). 

S. H. Lee, P. J. Kim, and H. Jeong, Phys. Rev. E 73, 
016102 (2006). 

L. Goodman, Annals of Mathematical Statistics 32, 
148170 (1961). 

M. Kurant, A. Markopoulou, and P. Thiran, e-print [53] 
larXiv: 1004. 17291 f2010). 

M. Kurant, A. Markopoulou, and P. Thiran, e-print 

larXiv:1102.4"599l (2011). [54] 

D. E. Knuth, The Art of Computer Programming Vol 1 

3rd ed. (Addison- Wesley, Boston) (1997). [55] 

S. N. Dorogovtsev, J. F. F. Mendes, and A. N. Samukhin, 

Phys. Rev. E 64, 025101(R) (2001). [56] 

D. Garlaschelli and M. Loffredo, Phys. Rev. Lett. 93, 
268701 (2004). 

G. Bianconi, N. Gulbache, A. E. Motter, Phys. Rev. Lett. [57] 
100, 119701 (2008). [58] 

E. A. Leicht, M. E. J. Newman, Phys. Rev. Lett. 100, 
118703 (2008). 



D. Donato, S. Leonardi, S. Millozzi, and P. Tsaparas, J. 
Phys. A: Math. Theor. 41, 224017 (2008). 
L. Becchetti, C. Castillo, D. Donato, and A. Faz- 
zone, In Proceedings of the Workshop on Link Analysis 
(LinkKDD '06) (2006). 

T. Want, Y. Chen, Z. Zhang, P. Sun, B. Deng, and X. 

Li, In Proceedings of SigComm (2010). 

A. Broder et al., Computer Networks 33, 309 (2000). 

R. Tarjan, SIAM J. Comput. 1, 146 (1972). 

M. Levene and Poulovassilis, Web dynamics: adapting 

to change in content, size, topology and use (Springer) 

(2004). 

Stanford Large Network Dataset Collection 

- |http://snap.stanford.edu/data/index.htrnl| 

J. Leskovec, K. Lang, A. Dasgupta, and M. Mahoney, 

e-print larXiv:0810.1355l f2008). 

RaySoda.co.kr - http://www.raysoda.co.kr/ 
Wikipedia.org - http://www.wikipedia.org/ 
University of Florida Sparse Ma- 
trix Collection: Gleich group - 

http://www.cise.ufl.edu/research/sparse/matrices/Gleich/index.htm] 
Articles comprise about 50% of pages on Wikipedia, and 
contain content specific to a single topic. Category, por- 
tal, and disambiguation pages are organizational pages 
that sustain Wikipedia's structure. Topics that fall in a 
certain category should link to that category. For exam- 
ple, the category page, Mathematics and logic, acts as 
a high-level organizational page to which subtopic pages 
including Algebra, Numbers, Trigonometry, etc. link. Por- 
tals are top-level introductory pages for specific article 
topics or areas of interest. For example, Portal: Canada 
contains a brief introduction to Canada, a Canadian news 
feed, a table of contents of Wikipedia articles relating to 
Canada, etc. Topics related to a portal are not required 
to link to the portal. Disambiguation pages arise when a 
term refers to the title of more than one Wikipedia ar- 
ticle. For example, a disambiguation page exists for the 
topic Mercury, since three articles have Mercury as a title 
{Mercury (element), Mercury (planet), Mercury (mythol- 
ogy)). Redirect pages, on the other hand, do not contain 
content, but merely route readers elsewhere. They may 
be encountered, for instance, when a word is misspelled. 
We note that our Wikipedia networks are about twice as 
large as the Wikipedia networks in 

L. Buriol, C. Castillo, D. Donato, S. Leornardi, and S. 
A. Millozzi, In Proceedings of the 2006 IEEE /WIC/ ACM 
International Conference on Web Intelligence (2006). 
R. Albert, H. Jeong, and A.-L. Barabasi, Nature 401, 
130 (1999). 

J. G. Foster, D. V. Foster, P. Grassberger, and M. 
Paczuski, Proc. Natl. Acad. Sci. USA 107, 10815 (2010). 
The average is different from (k)' because the latter is 
an average over random nodes, which in j x the nodes are 
chosen as end points of random links. 
G. Zamora-Lopez et al., Phys. Rev. E 77, 016106 (2008). 
P. Holme and J. Saramaki, apprear in Physics Report, 
e-print lajXiv:1108.1780l (2011). 



